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Abstract 

Many existing speaker verification systems are reported to 
be vulnerable against different spoofing attacks, for example 
speaker-adapted speech synthesis, voice conversion, play back, 
etc. In order to detect these spoofed speech signals as a counter¬ 
measure, we propose a score level fusion approach with several 
different i-vector subsystems. We show that the acoustic level 
Mel-frequency cepstral coefficients (MFCC) features, the phase 
level modified group delay cepstral coefficients (MGDCC) and 
the phonetic level phoneme posterior probability (PPP) tandem 
features are effective for the countermeasure. Furthermore, fea¬ 
ture level fusion of these features before i-vector modeling also 
enhance the performance. A polynomial kernel support vec¬ 
tor machine is adopted as the supervised classifier. In order 
to enhance the generalizability of the countermeasure, we also 
adopted the cosine similarity and PLDA scoring as one-class 
classifications methods. By combining the proposed i-vector 
subsystems with the OpenSMILE baseline which covers the 
acoustic and prosodic information further improves the final 
performance. The proposed fusion system achieves 0.29% and 
3.26% EER on the development and test set of the database pro¬ 
vided by the INTERSPEECH 2015 automatic speaker verifica¬ 
tion spoofing and countermeasures challenge. 

Index Terms: speaker verification, spoofing and counter¬ 
measures, i-vector, modified group delay cepstral coefficients, 
phoneme posterior probability 

1. Introduction 

The goal of speaker verification is to automatically verify the 
claimed speaker identity given a segment of speech. In the past 
decade, speaker verification has attracted significant research 
attention with promising results jTJ. However, recently it is re¬ 
ported that many existing speaker verification systems are vul¬ 
nerable against different spoofing attacks, e.g. speaker-adapted 
speech synthesis, voice conversion, play back, etc. EnuHnmu 
Compared to text independent speaker verification, text de¬ 
pendent speaker verification is more robust against the play 
back spoofing since the speech content is constrained or pre¬ 
defined. Speaker-adapted speech synthesis and voice conver¬ 
sion are the most common spoofing methods that can con¬ 
vert arbitrary text or speech inputs towards the target speaker 
l2l . To enhance the robustness of speech verification system 
against spoofing attacks, different countermeasures have been 
proposed. In (7), higher-level dynamic features and voice qual¬ 
ity assessment are used to detect those artificial signals. Fur¬ 
thermore, modified group delay cepstral coefficients (MGDCC) 
feature has been proposed to distinguish between the original 
and the spoofed speech signals in the phase domain GO. This 


approach is based on the fact that the phase information of syn¬ 
thetic spoofing speech is typically different from the real human 
articulated speech while the human auditory system is less sen¬ 
sitive to this difference. Long term temporal modulation feature 
derived from magnitude or phase spectrum has also been pro¬ 
posed to detect the synthetic speech j9). 

Total variability i-vector modeling has been widely used 
in speaker verification due to its excellent performance, com¬ 
pact representation and small model size 11011111 . In this work, 
we apply the recently proposed generalized i-vector framework 
(33 QHIHIH with both the acoustic and phonetic features to 
the countermeasure task. 

Figure [T] shows an overview of our anti-spoofing counter¬ 
measure system. First, there are several i-vector subsystems us¬ 
ing different features, namely the acoustic level Mel-frequency 
cepstral coefficients (MFCC) features, the phase level MGDCC 
features, the phonetic level phoneme posterior probability (PPP) 
tandem features 114] [Ij6] and their feature level combinations. 
Second, we also applied the openSMIFE toolkit na to per¬ 
form the utterance level acoustic and prosodic feature extrac¬ 
tion. We believe that the spoofed speech signal may have dif¬ 
ferent prosodic patterns. Third, after the feature normalization, 
multiple classification methods, e.g. cosine scoring, K-nearest 
neighbor (KNN), simplified PFDA m and Support Vector Ma¬ 
chine (SVM), are employed as the back end. Finally, score level 
fusion is performed to further enhance the overall system per¬ 
formance. 

The remainder of the paper is organized as follows. The 
coipus and the proposed algorithms are explained in Sections 
[2]and[3] respectively. Experimental results and discussions are 
presented in Section [4] while conclusions are provided in Sec¬ 
tion [5] 

2. Corpus 

The database used to evaluate the proposed methods is based 
upon a standard dataset of both genuine and spoofed speech. 
Genuine speech is without significant channel or background 
noise effect and includes 106 speakers (45 male, 61 female), 
while spoofed speech is obtained through applying several 
spoofing algorithms on the genuine speech fl9l . The train¬ 
ing data set (25 speakers, 3750 genuine utterances and 12635 
spoofed utterances) is for model training while the develop¬ 
ment data set (35 speakers, 3497 genuine utterances and 49875 
spoofed utterances) is used to evaluate the system performance 
and tune the parameters. Finally, the testing data set (46 speak¬ 
ers, 193404 utterances) with unknown types of spoofing attacks 
is provided to obtain the official submission scores. The details 
of the database and evaluation protocol are provided in ED- 
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Figure 1: The system overview 


3. Methods 

From Figure[I] we can see that there are four different features, 
namely MFCC i-vectors, MFCC-PPP i-vectors, MGDCC-PPP 
i-vectors and openSMILE feature vectors followed by the 
same feature normalization, classification and score level fu¬ 
sion pipeline. We first present the proposed features in section 
m Then section [T2| describes the supervised classification and 
score level fusion methods, respectively. 

3.1. Features 

3.1.1. The i-vectorframework 

In the total variability space, there is no distinction between 
the speaker effects and the channel effects. Rather than sepa¬ 
rately using the eigenvoice matrix V and the eigenchannel ma¬ 
trix U |[20j, the total variability space simultaneously captures 
the speaker and channel variabilities 02 Given a C com¬ 
ponent GMM UBM model A with A c = {p c , /x c , S c }, c = 
1, • ■ • ,C and an utterance with a L frame feature sequence 
{yi,- ■ • ,yi.}, the zero-order and centered first-order Baum- 
Welch statistics on the UBM are calculated as follows: 

L 

Nc = J2 p (° ly t , a) (i) 

t =1 
L 

F c = ^P(c|y t , A)(y t - p c ) (2) 

t =i 

where c = 1,••• ,C is the GMM component index and 
P(c|y t , A) is the occupancy posterior probability for y t on A c . 
The corresponding centered mean supervector F is generated 
by concatenating all the F c together: 

rjt _ Sf = iF(c|yt,A)(y t ~/r c ) 
Ef =1 F(c|y t ,A) 


Then the centered mean supervector F is projected as follows: 

F ->■ Tx, (4) 

where T is a rectangular low rank total variability matrix and x 
is the so-called i-vector ED. 

3.1.2. The MFCC i-vector 

The MFCC i-vector is extracted by the aforementioned i-vector 
framework with the acoustic level MFCC features. For cepstral 
feature extraction, a 25ms Flamming window with 10ms shifts 
was adopted. Each utterance was converted into a sequence of 
36-dimensional feature vectors, each consisting of 18 MFCC 
coefficients and their first order derivatives. We employed the 
English phoneme recognizer ED to perform the voice activity 
detection (VAD) by simply dropping all frames that are decoded 
as silence or speaker noises. 

3.1.3. The MFCC-PPP i-vector 

It is reported in [HllIIl that by combining the phonetic level 
phoneme posterior probability based tandem features with the 
acoustic level MFCC features at the feature level, the perfor¬ 
mances on speaker verification and language identification are 
significantly enhanced. In this work, the MFCC-PPP i-vector 
is extracted the same way as in fl4l following the generalized 
i-vector framework. We employed the multilayer perceptron 
(MLP) based phoneme recognizer ED with a provided English 
acoustic model trained on the TIMIT database to perform the 
phoneme decoding. The GMM model size and the tandem fea¬ 
ture dimensionality are 512 and 32, respectively. 

3.1.4. The MGDCC-PPP i-vector 

The MGDCC-PPP i-vector is calculated the same way as the 
MFCC-PPP i-vector except that here we replace the acoustic 
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LIB LINEAR 

LIBPOLY 

COSINE SCORING 

KNN 

Simplified 

PLDA 

two stage 

PLDA 

i 

MFCC i-vector 

8.46 

6.63 

16.1 

9.95 

12.01 

17.84 

2 

PPP i-vector 

1.72 

1.26 

3.6 

3.4 

2.29 


3 

MFCC-PPP i-vector 

1.86 

1.06 

2.86 

2.46 

1.89 

10.18 

4 

MGDCC-MFCC-PPP i-vector 

2.97 

2.06 

6.52 

3.43 

3.95 

17.79 

5 

OPENSmile 

2.03 

1.57 





6 

Fusion 1+2+3+4 



1.63 

1.37 

1.09 


7 

Fusion 1+2+3+4+5 

0.54 

0.29 






Table 1: Performance of the proposed methods on the development data 



SI 

S2 

S3 

S4 

S5 

S6 

S7 

S8 

S9 

S10 

Average 

Fusion 1+2+3+4+5-LIB POLY 

0.1137 

1.0332 

0.0482 

0.0412 

0.6614 

0.7112 

0.2297 

0.0108 

0.1336 

29.6649 

3.265 


Table 2: Performance of the fusion systems with different spoofing conditions on the testing data 


level MFCC features with the phase domain MGDCC features. 
The MGDCC feature is a kind of frame-level feature focusing 
on the speech phase characteristics. It has been shown that 
phase domain features are effective for anti-spoofing counter¬ 
measures ®. In order to calculate the MGDCC feature, we 
need to obtain the modified group delay function phase spec¬ 
trum (MGDFPS) (22) first. 

Given the data x n of a short time window, the MGDFPS 
spectrum r Pl7 (w) is calculated as follows (22): 

X r (ui)Y r (u) + Y/(w)X/(w) m 

Tpilj) =-- (5) 

= 7^ MtMu ;)| 7 (6) 

M<*0I 

where A' (c o) and Yru) are the fourier transforms of speech sig¬ 
nal x{n) and nx(n ); A ’r(u>) and X/(ui) are the real and imag¬ 
inary parts of X (oj); Yr(uj) and Yi (w) are the real and imag¬ 
inary parts of Y(u>), respectively. |S(w)| 2 is calculated by ap¬ 
plying a smoothing over X (uj) (22 1. After applying the Mel- 
frequency filter banks and Discrete Cosine Transform, MGDCC 
feature is obtained. More details can be found in 0. 

3.1.5. The OpenSMILE feature vector 

The OpenSMILE feature is a 6373 dimensional utterance level 
feature vector extracted by the OpenSMILE toolkit CD using 
the configuration file provided by the 2014 Paralinguistic Chal¬ 
lenge (23j . Since various kinds of features, such as MFCC, 
loudness, auditory spectrum, voicing probability, F0, F0 en¬ 
velop, jitter, and shimmer, etc., are included, this feature set can 
capture spoofing information at both the acoustic and prosodic 
levels. In our system, it served as a baseline as well as a supple¬ 
ment to those i-vector subsystems. 

3.2. Back-end modeling 

After feature vectors are extracted, we apply different classifi¬ 
cation methods for the back-end modeling. 

3.2.1. The K-nearest neighbor classification (KNN) 

KNN is a non-parametric multi-class classifier. The utterances 
in the training set are divided into human set and spoofed set. 
For each test utterance xt, K nearest neighboring utterances are 
found in the training set and the score is calculated based on the 
class distribution of these K nearest neighbors. 


3.2.2. The cosine similarity scoring 

The cosine similarity between two vectors is calculated as fol¬ 
lows: 

X y 

similarity(x, y) = . n (7) 

IM| 2 ||y||2 

In our system, a mean vector of all the human utterances in 
the training data set is calculated. For each test utterance, the 
score is computed as the cosine similarity between itself and the 
human class mean vector. 

3.2.3. PLDA modeling 

We first applied the simplified PLDA modeling fl8l as the back¬ 
end assuming that there are six special speakers (five spoofing 
channels plus one human channel), each represents a spoof¬ 
ing type or the original genuine speech. Furthermore, we also 
adopted the two subspace (speaker subspace and spoofing sub¬ 
space) PLDA presented in m to model the i-vectors. The stan¬ 
dard log likelihood ratio based hypothesis is emploied for the 
scoring Ifl8ll24j. 

3.2.4. Support Vector Machine 

We formed the anti-spoofing countermeasure as a two class 
classification task for SVM modeling. The linear kernel LIB- 
LINEAR (25l and its polynomial kernel extension LIBPOLY 
1|26| are adopted as the back-end SVM classifiers and we ap¬ 
plied the min/max normalization (range -1 to +1) for each fea¬ 
ture dimension on the training, development and test sets with 
parameters computed only from the training data. 

3.2.5. Scorefusion 

We simply employed the weighted summation fusion approach 
at the score level to further enhance the performance. The fu¬ 
sion weights were tuned on the development data set. 

4. Experimental results 

The results of our four subsystems on the development data are 
shown in the Table [3] We can observe that feature level fusion 
with PPP feature improves the performance. Compared to the 
MFCC i-vector subsystem (EER = 6.63%), the EER of MFCC- 
PPP i-vector subsystem is reduced to 1.06%. On the other hand, 
the OpenSMILE feature outperformed the MFCC i-vector sub¬ 
system which might be due to the inclusion of prosodic level 
information. 


































Methods 

EER(%) 

MFCC i-vector 

6.63 

MFCC-PPP i-vector 

1.06 

MGDCC-PPP i-vector 

2.23 

OpenSmile 

1.57 


Table 3: Performance of the four subsystems on the develop¬ 
ment data 


train set 

test set 

PLDA 

LIBLINEAR 

human+spoof[2,3,4,5] 

human+spoof[l] 

3.57 

3.4 

human+spoof [ 1,3,4,5] 

human+spoof[2] 

4.8 

7.69 

human+spoof [ 1,2,4,5] 

human+spoof[3] 

0.2 

0.71 

human+spoof [ 1,2,3,5] 

human+spoof[4] 

0.2 

0.66 

human+spoof [ 1,2,3,4] 

human+spoof [5] 

4.49 

11.81 


Table 5: Performance of the LIBLINEAR and the simplified 
PLDA backends on the unknown spoofing testing conditions 


polynomial 
kernel degree 

1 

(LIBLINEAR) 

2 

(LIBPOLY) 

3 

4 

10 

EER 

1.86 

1.06 

1.03 

1.00 

2.32 


Table 4: Performance of the MFCC-PPP i-vector SVM subsys¬ 
tems with different polynomial kernel degrees 


Furthermore, to obtain a robust countermeasure system, dif¬ 
ferent backend classification techniques were evaluated. Table|Tj 
shows the performance on the development data. Among these 
six classification methods, L1BPOLY achieves the best perfor¬ 
mance with 0.29% EER on the development data. The improve¬ 
ment of LIBPOLY against LIBLINEAR motivated us to further 
increase the SVM polynomial kernel degree. Table[4]shows that 
SVM with high degree polynomial kernel may lead to overfit¬ 
ting. 

With regard to PLDA backends, it shows that the simplified 
PLDA tends to be more robust against those unseen spoofing 
attacks. As shown in Table [5] we simulated unknown spoofing 
attacks by using four kinds of spoofed utterances in the training 
and the remaining one in the testing. Although its performance 
was as good as LIBLINEAR against familiar spoofing attacks, 
it outperformed LIBLINEAR on the unseen testing data, espe¬ 
cially where the unknown attacks were related to speech synthe¬ 
sis (index 3 and 4). The two stage PLDA only achieved mod¬ 
erate results in Table [T] which might be because total speakers 
number in the training data is limited (25) and the speaker sub¬ 
space may not be orthogonal to the spoofing subspace. 

Table [2] presents our fusion system results with each indi¬ 
vidual spoofing condition on the test data. Here SI to S5 are 
know attacks and S6 to S10 are unknown attacks. Our system 
performed well on all attacks except S10, on which most chal¬ 
lenge participants got unsatisfied results. 

Finally, our fusion system (system 7) achieved 0.38% and 
6.15% EER against known and unknown attacks, respectively. 

5. Conclusions 

This paper presents an anti-spoofing countermeasure system 
based on a multi-feature and multi-subsystem fusion approach. 
By fusing the phonetic level phoneme posterior probability tan¬ 
dem features with the acoustic level MFCC features or the 
phase level MGDCC features, the system performance is sig¬ 
nificantly enhanced. Combining the proposed i-vector subsys¬ 
tems with the OpenSMILE baseline which covers the acoustic 
and prosodic level information further improves the final perfor¬ 
mance. For the back-end modeling, two classes support vector 
machine outperforms the one class cosine similarity or PLDA 
scoring on the development data where the spoofing attack types 
are known. The one class scoring method achieves more robust 
performance on the unseen testing data where the spoofing con¬ 
ditions are unknown. 
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